Inverted Bilingual Topic Models for Lexicon Extraction from Non-parallel Data
نویسندگان
چکیده
Topic models have been successfully applied in lexicon extraction. However, most previous methods are limited to document-aligned data. In this paper, we try to address two challenges of applying topic models to lexicon extraction in non-parallel data: 1) hard to model the word relationship and 2) noisy seed dictionary. To solve these two challenges, we propose two new bilingual topic models to better capture the semantic information of each word while discriminating the multiple translations in a noisy seed dictionary. We extend the scope of topic models by inverting the roles of ”word” and ”document”. In addition, to solve the problem of noise in seed dictionary, we incorporate the probability of translation selection in our models. Moreover, we also propose an effective measure to evaluate the similarity of words in different languages and select the optimal translation pairs. Experimental results using real world data demonstrate the utility and efficacy of the proposed models.
منابع مشابه
A Combination of Models for Bilingual Lexicon Extraction from Comparable Corpora
In this paper we present a method to extract bilingual terminologies from comparable non-aligned corpora, by using multiple linguistic knowledge sources, such as: non-parallel corpora, bilingual thesauri, a preliminary bilingual dictionary, etc... We focus on two core technologies: bilingual lexicon extraction from comparable corpora and expansion through thesauri categories based on different ...
متن کاملUtilizing Contextually Relevant Terms in Bilingual Lexicon Extraction
This paper demonstrates one efficient technique in extracting bilingual word pairs from non-parallel but comparable corpora. Instead of using the common approach of taking high frequency words to build up the initial bilingual lexicon, we show contextually relevant terms that co-occur with cognate pairs can be efficiently utilized to build a bilingual dictionary. The result shows that our model...
متن کاملPivot, Box and Trilingual: Lexicon Extraction for Low-Resource Language Pairs with Extended Topic Models
Data-driven approaches to natural language processing have been shown to be greatly effective, and the case of bilingual lexicon extraction is no exception. While training data is readily available for many language pairs, many existing approaches fail for languages for which there simply does not exist parallel data. While there have been many studies on bilingual lexicon extraction, there has...
متن کاملBilingual Distributed Word Representations from Document-Aligned Comparable Data
We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation reso...
متن کاملPivot-Based Topic Models for Low-Resource Lexicon Extraction
This paper proposes a range of solutions to the challenges of extracting large and highquality bilingual lexicons for low-resource language pairs. In such scenarios there is often no parallel or even comparable data available. We design three effective pivotbased approaches inspired by the state-ofthe-art technique of bilingual topic modelling, extending previous work to take advantage of trili...
متن کامل